Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Introduction This paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions. The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies. However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics. Methods Our system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics. The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system's comprehension of opponent tactical information by establishing semantic associations between images and text. To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model's generalization performance on multimodal data. Results Through cross-modal transfer learning, tactical information learned from images by the Swin Transformer is effectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights. Our method is rigorously tested and validated using Sport UV, Sports-1M, HMDB51, and NPU RGB+D datasets. Experimental results demonstrate the system's impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity. Notably, the system outperforms other models, with a remarkable 8.47% lower prediction error (MAE) on the Kinetics dataset, accompanied by a 72.86-second reduction in training time. Discussion The presented system proves to be highly suitable for real-time sports competition assistance and analysis, offering a novel and effective approach for an Intelligent Robot Sports Competition Tactical Analysis Model that maximizes the potential of multimodal perception technology. By harnessing the synergies between the Swin Transformer and CLIP models, we address the limitations of traditional methods and significantly advance the field of sports competition analysis. This innovative model opens up new avenues for comprehensive tactical analysis in sports, benefiting coaches, athletes, and sports enthusiasts alike.

Introduction: This paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions.The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies.However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics.Methods: Our system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics.The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system's comprehension of opponent tactical information by establishing semantic associations between images and text.To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model's generalization performance on multimodal data.
Results: Through cross-modal transfer learning, tactical information learned from images by the Swin Transformer is e ectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights.Our method is rigorously tested and validated using Sport UV, Sports-M, HMDB , and NPU RGB+D datasets.Experimental results demonstrate the system's impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity.Notably, the system outperforms other models, with a remarkable .% lower prediction error (MAE) on the Kinetics dataset, accompanied by a .-second reduction in training time.

. Introduction
With the advancement of sports competition levels, indepth analysis of the opponent's tactics has become the key to winning games.A profound understanding of each other's strategies provides a more effective competitive strategy (Pan, 2022).However, current analysis methods are primarily based on a single data source, such as video replays or simple statistics, often failing to provide a comprehensive tactical portrait of the opponent.Additionally, traditional analysis methods often overlook the value of multi-modal data, such as text descriptions and athlete action data, which can offer rich contextual information for tactical analysis.Due to these limitations, current tactical analysis methods often fall short of meeting the demands of high-level competitive sports.With the rapid development of artificial intelligence technology, innovative, and practical research approaches in the field of tactical analysis have emerged, driving the development, and application of intelligent sports assistance (Olan et al., 2022).
In past research, scholars have explored different deep learning or machine learning models to construct sports competition tactical analysis model.For instance, Wenninger et al. (2020) employed Convolutional Neural Networks (CNNs) to recognize players' poses in basketball games, assisting coaches in tactical analysis and decision-making.However, this method exhibits limitations in handling complex scenarios and multimodal information, resulting in inaccuracies due to inadequate consideration of player interactions.To address these shortcomings, Tabrizi et al. (2020) proposed an improved LSTM model for intelligent robot motion assistance training system.Through the training and testing of the table tennis player's forehand hitting signal, the player's next hitting state is predicted.Although this method predicts the commonly used hitting state of players to a certain extent, it shows low efficiency when processing long sequences, and does not work well when processing large amounts of image data.
In recent years, researchers have explored the application of Transformer models in the Intelligent Robot Sports Assistant Training System.Yuan et al. (2021) introduced the Vision Transformer (ViT) model, transforming image data into sequences for processing and achieving excellent image feature representation.However, this method faced computational and storage resource pressures when dealing with largesized images, limiting its practical application in real sports competition scenarios.
To overcome these challenges, this paper proposes an intelligent robot sports competition tactical analysis model based on multi-modal perception.Firstly, we introduce the Swin Transformer (Liu et al., 2021) and CLIP models (Park et al., 2023) to achieve comprehensive observation and analysis of opponent tactics through multi-modal perception techniques.Secondly, we adopt cross-modal transfer learning (Wang and Yoon, 2021) to transfer opponent tactical information learned from images to the text modality, thereby enhancing the system's semantic understanding between images and texts.Finally, we establish a multi-modal tactical analysis and reasoning framework to predict opponent strategies and behavior patterns, providing coaches and athletes with richer and more accurate tactical decision support.
The contribution points of this paper are as follows: • Introducing multi-modal perception techniques to enhance observation and analysis of opponent tactics. .

Related work
Compared to methods based on graph node-edge processing (Yun et al., 2019;Kong et al., 2022) and multiview approaches, methods based on Graph Neural Networks (GNNs; Ning et al., 2023) directly utilize graphs to capture relationships and interactions between entities in a given domain.GNNs can be employed for comprehensive analysis of contextaware motion data (Sanford et al., 2020;Ning et al., 2023), providing a deeper understanding of opponent movements and deployed strategies, ultimately supporting better decisionmaking.They exhibit high flexibility and scalability (Victor et al., 2021), making them suitable for capturing complex and dynamic interactions in various sports competitions.However, the performance of GNNs heavily relies on the completeness and quality of graph data (Maglo et al., 2022).Without a clear, complete, and accurate graphical representation, the model may fail to capture key inter-entity relationships, which may make it difficult for the model to understand the tactical relationships between opposing players.
Recently, Generative Adversarial Networks (GANs) have shown significant potential in fields like computer vision, improving the performance of action recognition models through the generation of realistic synthetic motion videos (Wang et al., 2019).GANs have the capability to generate simulated game scenarios, demonstrating strong generalization ability (Dash et al., 2021;Hong et al., 2021), and providing valuable analysis imagery for tactical analysis.However, they suffer from the issue of "mode collapse, " where the generator may continuously produce highly similar outputs, limiting the diversity of generated data (Liu et al., 2020), which could hinder the understanding of tactical relationships between opposing players.
Furthermore, recent approaches utilize Transformer-like networks (Nweke et al., 2018) to capture critical information through self-attention, enabling them to capture temporal and spatial dependencies within video frames and enhance action recognition performance in dynamic motion scenes (Li et al., 2020).The Attention Mechanism for Action Recognition (ATTET) is capable of handling multi-modal input information and efficiently integrating information from these diverse sources (Pareek and Thakkar, 2021;Chen and Ho, 2022), further improving the accuracy of action recognition in sports videos.The temporal attention mechanism in ATTET ensures that the model focuses on the most relevant frames, making it robust to changes in ./fnbot. .action speed and duration commonly encountered in sports competitions (Chen et al., 2021;Yao et al., 2023).The spatial attention mechanism allows ATTET to selectively concentrate on relevant regions within video frames, effectively reducing noise and improving the model's discriminative power (Liu Z. et al., 2022;Li et al., 2023).However, introducing attention mechanisms in ATTET may increase computational complexity, limiting the model's transparency, especially in low-quality or complex background scenarios, making it challenging to apply the model to sports competition videos (Ma et al., 2022).

. Methodology . . Overview of our network
We propose an intelligent robot Sports competition tactical analysis model based on multimodal perception.The overall process, as depicted in Figure 1.This system leverages the Swin Transformer and CLIP models and employs cross-modal transfer learning to observe and analyze opponent tactics in sports competitions.The Swin Transformer is utilized to learn tactical information from opponent's dynamic video images, capturing their movement postures and behavior patterns in basketball or football games.Meanwhile, CLIP establishes semantic associations between images and texts in a shared latent space, enhancing the system's ability to understand and analyze the opponent's tactical information.By utilizing cross-modal transfer learning, the tactical information learned by Swin Transformer from images is effectively transferred to the CLIP model, providing coaches, and athletes with comprehensive tactical insights.
Swin Transformer, equipped with a layered attention mechanism, effectively captures local and global information from images, empowering robust feature extraction.In our system, Swin Transformer learns the opponent's movement postures and behavior patterns in basketball or football games.CLIP, pre-trained on a text description dataset, establishes semantic associations between images and text in a shared latent space.The CLIP model successfully maps images and texts to the same space, enabling semantic retrieval and matching.It plays a crucial role in fusing text information to further enhance the system's ability to understand and analyze opponent tactical information.Through cross-modal transfer learning, the opponent's tactical information learned from images by Swin Transformer is transmitted to the CLIP model, significantly enhancing CLIP's ability to understand the association between images and text.This process allows the system to more effectively analyze the opponent's tactical strategy and behavior patterns.
We integrate the trained Swin Transformer and CLIP models into an auxiliary training system for intelligent robot sports competitions.The system receives image and text data from sports competition scenes, extracting image features through Swin Transformer, and using CLIP to carry out semantic associations between images and text.This integration enables the system to effectively observe and analyze opponent tactics.By conducting comprehensive analysis of image and text data, coaches, and athletes gain valuable insights into the opponent's possible tactical strategies and behavior patterns, providing robust support for decision-making and response during competitions.

. . Swin transformer
Swin Transformer is a deep learning model based on the Transformer architecture, specifically designed for image processing tasks.In contrast to the traditional Transformer model, Swin Transformer introduces a layered image processing strategy utilizing block and window methods to efficiently handle large-size images (Li and Bhanu, 2023).This approach significantly improves calculation speed and performance, particularly when processing high-resolution images, while maintaining memory efficiency.An overview of the Swin Transformer process can be seen in Figure 2.
The fundamental principle of Swin Transformer lies in achieving image feature extraction and representation learning through a multi-layer Self-Attention mechanism.It divides the image into fixed-size blocks and conducts Self-Attention operations within each block to capture local image features.Subsequently, interaction between different blocks is achieved through windowing, enabling the extraction of global image features.This multi-layer process occurs within a Transformer encoder, gradually learning higher-level image representations.
Within the intelligent robot sports competition tactical analysis model, Swin Transformer is employed to learn the opponent's image tactical information, such as observing the opponent's movement posture and behavior patterns in basketball or football games.Leveraging its efficient and high-performance features, Swin Transformer adeptly processes a substantial amount of image data and extracts rich image features, providing robust support for observing and analyzing opponent tactics.
The Swin Transformer model is represented by the following Equation (Chen and Mo, 2023): Here, Query, Key, and Value represent the input Query, Key, and Value vectors, respectively.d k denotes the dimension of the Query and Key vectors, and softmax refers to the softmax function.Specifically, Query serves as the query vector to find the relevant Key and Value, while Key acts as the key vector to compute the relevance score between the Query and Key vectors.The Value vector is then weighted according to the relevance score to obtain the final output.
In the Swin Transformer, the Multi-head Self-Attention is a crucial step in implementing the self-attention mechanism.It calculates the relevance score between Query and Key vectors and then uses this score to perform a weighted average of the Value vectors, yielding the final output.Through multiple layers of self-attention operations, the Swin Transformer can capture both local and global features of images, achieving efficient, and accurate image feature extraction.The formula for Multi-head Self-Attention is as follows: The input Query, Key, and Value are represented as  the sequence length, and d q , d k , and d v represent the feature dimensions of Query, Key, and Value.The multi-head attention mechanism maps the input Query, Key, and Value to h subspaces, where self-attention calculations are performed for each subspace.Assuming the dimension of each subspace is d head = d k h , the computation formula for Multi-head Self-Attention is as follows: Here, head i = Attention(QW Q i , KW K i , VW V i ) represents the attention calculation for the i-th subspace, where head are the weight matrices for linear mapping in the i-th subspace, and W O ∈ R h•d head ×d v represents the output mapping weight matrix.
Attention(Q, K, V) denotes the standard Scaled Dot-Product Attention calculation, which is formulated as: In the Swin Transformer, the parallel computation through the multi-head attention mechanism effectively captures both local and global features of the image, thereby improving the efficiency and accuracy of feature extraction.
. .CLIP model CLIP (Contrastive Language-Image Pretraining); Tevet et al. ( 2022) is a multimodal learning model introduced by OpenAI.Its fundamental principle involves learning from the contrast between images and texts, enabling both modalities to share the same embedding space for cross-modal semantic understanding and matching (Wang et al., 2022).The primary objective of CLIP is to bring images and texts from the same semantic category closer together in a shared embedding space, while keeping images and texts from different semantic categories farther apart.This semantic alignment allows CLIP to convert and match images and texts with each other effectively.An overview of the CLIP process can be seen in Figure 3.
In the intelligent robot Sports competition tactical analysis model, CLIP plays a crucial role in aligning the opponent's tactical information learned from images with the tactical information acquired from text, thereby enhancing the understanding and analysis of opponent tactics.Leveraging CLIP, intelligent robots can achieve semantic matching between images and texts, enabling speculation on possible tactical strategies and behavior patterns of opponents.Consequently, coaches and athletes receive richer and more accurate tactical decision support.The formula for CLIP is as follows: The input image feature is represented by I ∈ R N×d , where N denotes the number of images, and d represents the dimension of the image feature.Similarly, the input text features are denoted by T ∈ R M×d , where M signifies the number of texts, and d indicates the dimension of the text features.CLIP aims to minimize the contrast loss between images and texts, facilitating the proximity of images and texts from the same semantic category in the embedding space, and ensuring a larger distance between images and texts from different semantic categories.
For image and text features, CLIP employs a standard contrastive loss function as follows (Shen S. et al., 2021): the cosine similarity between the image feature I i and the text feature T j .
CLIP minimizes the contrastive loss function to optimize the semantic matching between images and texts.This process ensures a reasonable distribution of distances between them in the shared embedding space, facilitating semantic alignment, and matching of multimodal information.

. . Cross-modal transfer learning
Cross-modal transfer learning is a form of multi-modal transfer learning method that leverages a shared model, such as an image-based model, for knowledge transfer to enhance high-level modeling capabilities and performance (Zhen et al., 2020).An overview of the Cross-Transfer Learning process can be seen in Figure 4.
The image features are denoted as I ∈ R N×d 1 , where N is the number of displayed images, and d 1 represents the image feature dimension.The text features are represented by T ∈ R M×d 2 , where M is the number of displayed texts, and d 2 signifies the text feature dimension.As both image and text features can be represented in a shared embedding space, a linear projection matrix W ∈ R d 1 ×d 2 is used to map image features to text feature space.By minimizing the distance between the projected image features and the original text features.
The Cross-modal transfer learning loss function is defined as follows: (5) Here, I i represents the feature of the i-th image, T ij represents the feature of the ith text and the feature of the jth image, and W is the linear mapping matrix to be learned.
By minimizing the Cross-modal transfer learning loss function, feature transfer from images to texts is achieved, enhancing the system's semantic understanding of the relationship between images and texts.Consequently, this improvement enhances the observation and analysis capabilities of the intelligent robot sports assistant training system regarding opponent tactics.

. Experiment . . Datasets
This section provides an overview of the datasets used in the cross-modal transfer learning algorithm, along with details of their preprocessing.
The Sports-1M dataset (Carreira and Zisserman, 2017) represents an extensive collection of sports video clips covering a wide array of sports disciplines, including basketball, football, soccer, tennis, among others.Its primary utility lies in facilitating tasks related to sports action recognition, behavior analysis, and tactical comprehension.Each video clip within the dataset encompasses diverse sports actions, such as passing, shooting, running, defending, and more.Preliminary data preprocessing steps involved the selection of relevant clips and standardization of resolution and format.The Sports-1M dataset serves as a foundational resource for the cross-modal transfer learning algorithm, offering a diverse range of sports scenarios and annotated actions for model training and evaluation.
The SportVU dataset (Korbar et al., 2019) is an extensive sports tracking dataset that leverages high-resolution cameras and multi-sensor tracking technology.Although it covers various sports, the primary focus centers on basketball due  to its comprehensive representation within the dataset.In preparation for the cross-modal transfer learning algorithm, detailed information pertaining to player positions, movements, trajectories, velocities, accelerations, and more was extracted from the raw data.This involved meticulous data alignment and synchronization procedures.The dataset's high spatial and temporal resolution empowers fine-grained analyses of player actions, tactical patterns, and team strategies, .
/fnbot. .making it a valuable asset for the cross-modal transfer learning approach.
The NPU RGB+D dataset (Yang et al., 2019) is a unique multimodal dataset that combines RGB (color) and depth information for sports action analysis across various sports, including basketball and football.Data preparation steps encompassed the synchronization of RGB videos with corresponding depth maps, ensuring temporal alignment.The incorporation of depth data within the dataset enhances the accuracy and robustness of action recognition algorithms.The NPU RGB+D dataset plays a pivotal role in the study, enabling exploration of the potential of depthbased features within the domain of sports-related tasks.

. . Experimental details
In this paper, four data sets are selected for training, and the training process is as follows: Step1: Data preprocessing In the sports competition video dataset, the presence of noisy, missing, or inconsistent data is common.Data cleaning is essential to address these issues and ensure the overall quality and consistency of the dataset.This involves deduplicating records, handling missing values, and correcting data errors, among other tasks.Additionally, the data is formatted into a standardized structure to facilitate subsequent processing and model training.
Sports competition video data typically encompass a wealth of both image and text information.During the data preprocessing stage, relevant features must be extracted from the raw data and preprocessed to meet the model's input requirements.For image data, techniques such as image enhancement, cropping, and scaling are employed to derive valuable image features.Similarly, text information undergoes processing steps such as text cleaning, word segmentation, and encoding to facilitate subsequent cross-modal transfer learning and model training.
By addressing these two aspects of data preprocessing, we can ensure data quality and availability, providing well-suited inputs for subsequent model training and evaluation.This proves to be pivotal in constructing an effective auxiliary training system for intelligent robot sports competitions.
Step2: Model training Upon defining the architecture of the combined model, we proceed with the model training process.This comprehensive procedure involves loading the pre-trained parameters of the Swin Transformer and CLIP modules, as well as performing cross-modal transfer learning.
Initially, the pre-trained parameters of both the Swin Transformer and CLIP modules are loaded.These models have undergone training on large-scale datasets, acquiring rich representations from images and text.Subsequently, we meticulously prepare the training dataset, encompassing both image and text data, and ensure proper formatting and preprocessing before feeding it into the model.In this crucial step, we execute cross-modal transfer learning, unifying knowledge from the Swin Transformer and CLIP modules.Specifically, the Swin Transformer processes image data, while the CLIP module processes text data.The outputs from both modules are then fused and mapped, generating a joint representation that amalgamates image and text information.
Once the cross-modal transfer learning is accomplished, we compile the combined model with an appropriate loss function, optimizer, and evaluation metrics.The loss function serves as a guide, minimizing the discrepancy between predicted outputs and ground truth labels during training.The training process utilizes the prepared dataset to train the combined model.During this phase, the data flows through the Swin Transformer and CLIP modules, as well as the cross-modal transfer learning module.The resulting outputs are then combined and forwarded to the output layer for prediction.Finally, upon completing the training process, the trained combined model is saved to disk for subsequent use in sports competition assistance and analysis.
Step3: Model Evaluation After completing the model training, the next crucial step is the comprehensive evaluation of the model's performance.We employ a range of metrics to assess the accuracy and stability of the model's predictions.The key evaluation metrics include Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE %), Root Mean Squared Error (RMSE), and Mean Squared Error (MSE).These metrics enable us to quantify the prediction errors and provide valuable insights into the model's predictive capabilities.
In addition to the above metrics, we also measure the model's training time, which denotes the duration required to train the model on the training dataset.Furthermore, we evaluate the inference time, which represents the time taken by the model to make predictions on new data or perform inference tasks.These time measurements offer valuable information about the model's efficiency in real-time applications.Moreover, we assess the model's parameter count, which indicates the number of learnable parameters in the model.A lower parameter count suggests a more compact and potentially more interpretable model.Lastly, we analyze the computational complexity, which gives us insights into the amount of computational resources required during both model training and inference.Lower computational complexity signifies higher efficiency and scalability, making the model more feasible for practical deployment.
By conducting a comprehensive evaluation with a diverse set of metrics, we gain a thorough understanding of the model's performance, robustness, and efficiency, enabling us to make informed decisions for sports competition assistance and analysis applications.
Step4: Result analysis The experiments encompassed a comparison of different models, including Swin Transformer, CLIP, and the crossmodal transfer learning model.Several evaluation indicators were employed to assess the models' performance, namely, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE %), Root Mean Squared Error (RMSE), and Mean Squared Error (MSE).
The meaning and formulas of these evaluation indicators are as follows: 1. MAE (Mean Absolute Error): MAE measures the average absolute difference between the predicted values ( ŷi ) and the true values (y i ).It evaluates the model's prediction accuracy.

Method Dataset
SportVU dataset (Carreira and Zisserman, 2017) Sports-M dataset (Korbar et al., 2019) NPU RGB+D dataset (Wu et al., 2020) Parameters  RMSE represents the square root of the average of the squared differences between the predicted values and the true values.It measures the model's prediction stability.
4. MSE (Mean Squared Error): MSE calculates the average of the squared differences between the predicted values and the true values.It provides insights into the model's prediction accuracy and stability.
The impact of this research is significant for the application of intelligent robot sports competition assistant training systems.By effectively leveraging multi-modal perception, the proposed models have the potential to improve tactical analysis, behavior recognition, and overall performance assessment in various sports competitions such as basketball and football.The fusion of visual and textual information enhances the models' ability to understand opponents' tactics and strategies, providing valuable support for coaches, athletes, and analysts in their decisionmaking processes.
Algorithm 1 represents the algorithm flow of the training in this paper:

. . Experimental results and analysis
This study aims to investigate an intelligent robot Sports competition tactical analysis model.By integrating the Swin Transformer and CLIP models through cross-modal transfer learning, we can enhance the system's ability to analyze tactics and predict opponents' behaviors in sports competitions.The experiment utilizes multiple datasets, including SportVU, Sports-1M, and NPU RGB+D Dataset, to compare and analyze the performance of different models on these datasets.The comparison metrics include the number of model parameters, floating-point operations (FLOPs), inference time, and training time.
The experimental results are shown in  2023)'s method excels on the Sports-1M Dataset, with a small number of parameters and inference time, but shows relatively inferior performance on other datasets.The approach of Shen Z. et al. (2021) performs better on the SportUV Dataset but falls short compared to our method on other datasets.Vajsbaher et al. (2020)'s method shows promise on the NPU RGB+D Dataset, but its training time is longer.Liu Y. et al. (2022)'s method performs well in FLOPs but lacks in other performance metrics.Tao et al. (2020)'s method performs well on the Sports-1M Dataset and NPU RGB+D Dataset, but struggles on other datasets.Similarly, Ji et al. (2019)'s method delivers strong results on the Sports-1M DataseT, but is mediocre on other datasets.
Based on the comparative results and experimental principles, our method leverages cross-modal transfer learning to combine the Swin Transformer and CLIP models, enabling us to jointly infer the opponent's tactical strategy and behavior patterns from image and text information.Our model demonstrates robust performance across multiple datasets, indicating its versatility and scalability for Sports competition tactical analysis model.
This experiment aims to investigate the auxiliary effect of the intelligent robot sports competition training system by conducting a comprehensive comparison of different models on multiple datasets.The evaluation is based on several key metrics, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean Square Error (MSE), to assess the model's performance across diverse datasets.
The experimental results, presented in Table 2, demonstrate that our proposed method excels in all metrics, showcasing both high accuracy and stability.In comparison to alternative methods, our model outperforms in indicators such as MAE, MAPE, RMSE, and MSE, indicating its superior ability to predict sports competition outcomes with enhanced reasoning and predictive capabilities.Furthermore, our model consistently maintains low error levels across various datasets, confirming its robustness and adaptability to different scenarios.The visualization results of Table 2 are shown in Figure 6.
Notably, our method exhibits exceptional performance on the Sports-1M dataset, a large-scale sports video dataset featuring complex sports scenes and diverse actions, wherein our model achieves the minimal error value.This success underscores our model's adaptability and generalization capabilities for challenging sports competition scenes.
Additionally, our method not only excels in accuracy but also achieves significant advantages in computational efficiency.Compared to other comparative methods, our model offers ample room for optimization in terms of model parameters, FLOPs, and inference time, thereby providing efficient performance in computationally demanding environments.This efficiency translates into robust support for rapid training and real-time applications in practical sports competition scenarios.
As shown in Table 3 and Figure 7, we investigated the impact of the Swin-transformer module on the performance of the   metrics.These results validate the effectiveness of the Swintransformer module in accurately predicting sports competition outcomes and enhancing reasoning and prediction capabilities.Moreover, the module maintains consistently low error levels on different datasets, affirming its robustness, and adaptability across various sports competition scenarios.
Of particular note is the exceptional performance of the Swintransformer module on the Sports-1M dataset, characterized by complex sports scenes and diverse actions, where it achieved the lowest error values.This further supports the module's adaptability and generalization abilities in handling challenging sports competition scenes.Beyond its superior accuracy, the Swintransformer module also demonstrates significant advantages in computational efficiency.Compared to alternative models, such as ViT, EfficientNet, and ResNet50, our module presents ample room for optimization in terms of model parameters, FLOPs, and inference time.This computational efficiency is essential for fast training and real-time applications in practical sports competition scenarios.
Table 4 presents   Figure 8 visually represents the trends and highlights the significant efficiency and effectiveness advantages of the Swin-transformer module in the ablation experiments.The results indicate that the Swin-transformer module achieves impressive performance while requiring fewer parameters and computational resources, making it a highly efficient and effective choice for various sports-related applications.

. Conclusion and discussion
This study proposes an intelligent robot sports competition tactical analysis model based on multimodal perception.The experimental section of the article evaluates the performance of several state-of-the-art models on four sports competition datasets.The experimental results indicate that Tang et al.'s method performs well on the Sports-1M dataset, while Shen et al.'s method excels on the SportVU dataset.Furthermore, although our model does not significantly outperform these advanced models in terms of performance, it exhibits strong advantages in terms of inference time and training time.The Swin-transformer module in our model performs exceptionally well in ablation experiments, confirming its effectiveness in enhancing model performance.
In conclusion, this paper has introduced an intelligent robot sports competition tactical analysis model based on multimodal perception.Leveraging Swin Transformer and CLIP models along with cross-modal transfer learning, this system observes and analyzes opponent tactics in sports competitions.The proposed method has shown promise, demonstrating high prediction accuracy, efficiency, and suitability for real-time sports competition assistance and analysis.
However, the development and application of such technology come with ethical responsibilities.It is imperative to obtain informed consent, safeguard privacy, and address modality bias in data representation.Responsible resource allocation is necessary to ensure accessibility, particularly in resource-constrained settings.The introduction of real-time interaction capabilities should prioritize the integrity of sports competitions and inclusivity for all stakeholders.
Looking forward, there are exciting opportunities for further research to enhance the model's capabilities while addressing its limitations.These include mitigating modality bias, expanding the model's ability to process diverse data, improving efficiency, and exploring real-time feedback mechanisms.Additionally, integrating domain-specific knowledge and investigating human-robot collaboration in sports analysis present intriguing avenues for future work.Overall, this research contributes positively to the advancement of intelligent the editors and the reviewers.Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

FIGURE
FIGUREOverall flow chart of the model.

FIGURE
FIGURE Flow chart of the Swin Transformer model.(A) Architecture of Swin Transformer (Swin-T).(B) Two consecutive Swin Transformer blocks.D W-MSA and D SW-MSA are multi-head self-attention modules with regular and shifted window configurations respectively.

FIGURE
FIGUREFlow chart of the CLIP model.

FIGURE
FIGUREFlow chart of the Cross-Transfer Learning model.
7) MAPE calculates the average absolute percentage error between the predicted values and the true values.It assesses the model's relative accuracy.Input: Training dataset: Sports-1M, SportVU, NPU RGB+D Dataset Output: Trained SC-Transfer Net Initialize Swin Transformer model parameters; Initialize CLIP model parameters; Initialize cross-modal transfer learning model parameters; Initialize optimizer and loss function; while not convergence do for each batch in the training dataset do Retrieve image and text data from the batch; Save the trained SC-Transfer Net to disk; Algorithm .SC-transfer net training.

FIGURE
FIGUREAblation experiments on the Swin-transformer module.
the results of the ablation experiments on the Swin-transformer module.The experiments aimed to analyze the impact of the module on various performance metrics across different datasets.Four key metrics, namely Parameters (M), FLOPs (G), Inference Time (ms), and Training Time (s), were considered to assess the efficiency and effectiveness of the models.In the SportVU dataset, the Swin-transformer module demonstrated remarkable performance, achieving a reduced number of parameters (193.19M) and FLOPs (230.56G) compared to other methods such as ViT (393.33 M, 343.10 G) and EfficientNet (367.31M, 251.95 G).It also exhibited lower inference time (149.09ms) and training time (140.12s) compared to its counterparts.Similar trends were observed in the Sports-1M dataset, where the Swin-transformer module outperformed the other models in terms of all metrics, including Parameters (106.37 M), FLOPs (166.73G), Inference Time (210.37 ms), and Training Time (185.59 s).Moreover, on the HMDB51 dataset, the Swin-transformer

FIGURE
FIGUREAblation experiments on the Swin-transformer module.

•
Adopting cross-modal transfer learning to improve the semantic understanding between images and texts.
Table 1, our proposed method performs exceptionally well across multiple metrics.Compared to other comparative methods, our model significantly reduces the number of model parameters, FLOPs, and inference , indicating its advantages in complexity and computational efficiency.Moreover, our method exhibits fast training time, a critical factor for rapid training and real-time applications.The visualization results of Table 1 are shown in Figure 5.Among the comparison methods, Tang et al. ( time TABLE Comparison of di erent metrics for di erent models.
FIGUREComparison of di erent metrics for di erent models.
TABLE Comparison of di erent metrics for di erent models.
continued to showcase superior efficiency with lower Parameters (192.90M) and FLOPs (165.18G) compared to ViT and EfficientNet.The Inference Time (233.64 ms) and Training Time (152.24 s) of the Swin-transformer module were also lower than its competitors.Similarly, in the UCF101 dataset, the Swin-transformer module outperformed the other models, with the lowest Parameters (153.47M), FLOPs (158.01G), Inference Time (208.28 ms), and Training Time (144.80 s). module